Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turbine Tank Turbine Changes #437

Merged
merged 24 commits into from
Feb 29, 2024
Merged

Turbine Tank Turbine Changes #437

merged 24 commits into from
Feb 29, 2024

Conversation

saienduri
Copy link
Contributor

@saienduri saienduri commented Feb 14, 2024

This commit adds the implementation for turbine tank. This allows us to currently run llama, sd models, and 30 other models e2e and upload model artifacts to Azure. There currently is a folder uploaded into Azure tankturbine storage container using turbine tank if you are interested in structure and how we keep track of the version using the date + git_sha: container_link. The torch IR for every model is uploaded when the user choses to upload using turbine tank. Accuracy against a torch run is also tested for every model we can run e2e. This allows us to keep track of the accuracy of our models and avoid regressions. There is also an option to download turbine tank model artifacts locally, so you don't have to run the models. It is setup so that it will not redownload and use cached artifacts if already there. It will also update your local artifacts if they are not up to date. Further detail can be found in the code comments.

To run turbine tank:
Build turbine like normal.
Run python models/turbine_models/turbine_tank/run_tank.py

I will add the corresponding nightly CI job once this lands.

Copy link
Member

@dan-garvey dan-garvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the upload is something only a builder should do, so exposing it as a command line flag is probably not ideal. Additionally make sure the storage is publically READ only, privately writable.

Also, tests

models/turbine_models/turbine_tank/run_models.py Outdated Show resolved Hide resolved
@dan-garvey
Copy link
Member

@monorimet please ensure the sd stuff meets the modularity requirements you have on the frontend

@saienduri
Copy link
Contributor Author

I think the upload is something only a builder should do, so exposing it as a command line flag is probably not ideal. Additionally make sure the storage is publically READ only, privately writable.

Also, tests

Upload is not a command line flag anymore. Storage is publicly read only. Everything is also a test now.

models/turbine_models/custom_models/sd_inference/vae.py Outdated Show resolved Hide resolved
models/turbine_models/custom_models/sd_inference/unet.py Outdated Show resolved Hide resolved
models/turbine_models/custom_models/sd_inference/clip.py Outdated Show resolved Hide resolved

storage_account_key = "XSsr+KqxBLxXzRtFv3QbbdsAxdwDGe661Q1xY4ziMRtpCazN8W6HZePi6nwud5RNLC5Y7e410abg+AStyzmX1A=="
storage_account_name = "tankturbine"
connection_string = "DefaultEndpointsProtocol=https;AccountName=tankturbine;AccountKey=XSsr+KqxBLxXzRtFv3QbbdsAxdwDGe661Q1xY4ziMRtpCazN8W6HZePi6nwud5RNLC5Y7e410abg+AStyzmX1A==;EndpointSuffix=core.windows.net"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably dont want this public for security reasons, use a github secret or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I made them environment variables that we can pass in using github secrets for our github actions workflow.



def preprocess_input_image(model_name):
# from datasets import load_dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of dead code

Copy link
Contributor Author

@saienduri saienduri Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this code is used in test_tank.py as we use the different get methods which generate test input, torch output to compare to, and return a HF model to build/run based on the model_name we are trying to run through e2e. All the preprocessing and helper methods are necessary.

@saienduri
Copy link
Contributor Author

saienduri commented Feb 22, 2024

By the way, the test turbine models test failing here is not due to this code (passes with my venv that hasn't been updated). With the latest dependencies (specifically transformers), we are hitting this issue: TypeError: llama_pos_shift_attention_forward() got an unexpected keyword argument cache_position. I manually ran the test models job on the main branch, and it fails as well: https://github.com/nod-ai/SHARK-Turbine/actions/runs/7996420641. Specifically, LlamaRotaryEmbedding definition has changed and use of seq_len argument is deprecated and unused in forward. I can pin transformers dependency to a specific version until we figure this out if we want.

@dan-garvey
Copy link
Member

lint this when you get a chance

@saienduri saienduri changed the title Turbine Tank Turbine Tank Turbine Changes Feb 28, 2024
Copy link
Member

@dan-garvey dan-garvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit I probably missed on a previous pass
other than that looks good to land

with open(f"{safe_name}.mlir", "w+") as f:
f.write(module_str)
model_name_upload = hf_model_name.replace("/", "_")
model_name_upload = model_name_upload + "-vae-" + variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use underscores instead of dashes for consistency here and elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underscore is only used to separate the org and model name. Rest is all '-' (CompVis_stable-diffusion-v1-4-vae-decode)

@saienduri saienduri merged commit fe20538 into nod-ai:main Feb 29, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants